-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't send on unbuffered channel while holding a lock #397
Don't send on unbuffered channel while holding a lock #397
Conversation
Longhorn 7919 Signed-off-by: Eric Weber <eric.weber@suse.com>
73660c6
to
9f14426
Compare
I tested this by following the reproduce from longhorn/longhorn#7919 (comment) with a modified instance-manager that had these changes AND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
A side question, do we observe memory and CPU high consumption once there are many go routine stuck in this case? Maybe it could be triggered if we leave the deadlock for a long time and the go routine starts building up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
Good question. I put my cluster into the stuck state and am monitoring it. For now CPU is stable and low, but it looks like memory is slowly leaking in the instance manager pod as we build up goroutines like the following:
ProcessDelete keeps getting called because we are trying to detach volumes. Each server goroutine that is supposed to answer the request gets hung on the deadlocked RWMutex. From 445 MiB to 549 MiB in an hour and a half or so. |
@mergify backport v1.6.x |
✅ Backports have been created
|
@mergify backport v1.5.x v1.4.x |
✅ Backports have been created
|
In this case, this means the lock usage of the process manager (or generally) should not be locked and unlocked in a very short time to prevent any potential scalability issue. |
Which issue(s) this PR fixes:
longhorn/longhorn#7919
What this PR does / why we need it:
There is a potential deadlock caused by
pm.getProcessesToUpdateConditions
. See the linked issue for details.We only need the ProcessManager lock while we are iterating through the Process map. We should not continue to hold it while sending on an unbuffered channel.